Exploring edX data during 2012 to 2013 by Kan-Hua Lee

Basic information of the features

Number of rows and columns of the data

## [1] 641138     26

Number of unique users

## [1] 641138     30

Histograms of number of certificates:

## 
##      0      1      2      3      4      5 
## 613463  23966   2974    609     79     47

Univariate Plots Section

Registered users of each course

Certified users of each course

pass.rate of each course

Histogram of age among registered users

plot the histogram of all users vs. LOE_DI, with NA and blank (“”) filtered.

Histogram of gender among registered users

Plot the distribution of nevents
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Histogram of access.rate that certified==0

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Histogram of access.rate that certified==1

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Univariate Analysis

What is the structure of your dataset?

This data has *** column and *** rows. Each row is the data for a user taking a course.

What is/are the main feature(s) of interest in your dataset?

We mainly interested in the backgrounds and involvement of the students that attended and earned certification of each course. The features in this data set that we will focus on are: registered, certified, LoE_DI, YoB, gender, nevents and ndays_act.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

The following new variables are created in the original dataframe edxdata:

  • age: the age of the user when taking the course. It is calculated by 2013-YOB.
  • access.period: Day difference between last_event_DI and start_time_DI.
  • access.rate: ndays_act divided by access.period. This variable essentially measures the how often a users accesses the course.

Also, we grouped the raw data by the following a number of features and created new variables for each data set for each new data sets:

users

The data frame grouped and summarised the data by each user.

  • course_taken: number of courses viewed.
  • total_registered: number of courses registered.
  • total_explored: number of courses explored.
  • user.certificates: number of courses certified.
course_id

We grouped and summarised the data by course_id. The following new variables are created:

  • passed_num: total certified users of the course.
  • explored_num: total users who explored the course.
  • registered_num: total users who registered the course.
  • total_nforum_posts: total number of posts in the course forum.
  • pass.rate : the number of certificated users divided by the number of registered users
  • hangon.rate : the number of explored users divided by the number of registered users
course

Some of the course was offered more than one times durin 2012 and 2013. Therefore, another data frame is generated by grouping and summaring the raw data by course_code.

  • t_certified: number of total certified users.
  • t_viewed: number of users who viewed the course.
  • t_explored: number of users who explored the course.
  • t_registered: number fo users who registered the course.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A few preprocessings of the raw data were performed, as listed below

  • Transform the feature LoE_DI into levels.
  • Transform the features certified, explored and viewed into logical type.

Bivariate Plots Section

Histogram of certified users by LoE_DI

Histogram of certified users by gender

Distrubution of nevents of certified users only

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Boxplot of nevents against course

Boxplot of ndays_act against course

Boxplot of access.periods of each course

Boxplot of access.periods of each course

Analyze average login events of certified users

## Source: local data frame [13 x 3]
## 
##    course_code mean_nevents med_nevents
## 1        CB22x    2752.8906      2292.0
## 2        CS50x     284.1166       236.5
## 3        ER22x    1387.1113      1338.5
## 4       PH207x    6144.8561      5465.0
## 5       PH278x    1739.2003      1496.0
## 6       6.002x    5353.4035      4481.5
## 7        2.01x    5312.8340      4826.0
## 8       14.73x    4797.0168      4508.5
## 9       3.091x    7269.2455      6377.5
## 10       6.00x    7953.6741      7227.5
## 11       7.00x    5942.5189      5548.0
## 12       8.02x    9678.3005      9054.5
## 13      8.MReV    7108.4949      6568.0

The distribution LoE_DI of certified users

Distribution of LOE_DI of all users

ndays_act vs. nevents among the users who has explored more than half of the course:

access.rate vs. grade

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Plot the distribution of nevents

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Plot the distribution of nevents of certified users only

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Plot the distribution of access.rate of certified users:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection